transition point
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > Canada (0.04)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.32)
Predicting the Formation of Induction Heads
Aoyama, Tatsuya, Wilcox, Ethan Gotlieb, Schneider, Nathan
Arguably, specialized attention heads dubbed induction heads (IHs) underlie the remarkable in-context learning (ICL) capabilities of modern language models (LMs); yet, a precise characterization of their formation remains unclear. In this study, we investigate the relationship between statistical properties of training data (for both natural and synthetic data) and IH formation. We show that (1) a simple equation combining batch size and context size predicts the point at which IHs form; (2) surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find a precise Pareto frontier in terms of these two values; and (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.
- Europe > Austria > Vienna (0.14)
- South America > Paraguay > Asunción > Asunción (0.04)
- Europe > Norway > Norwegian Sea (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.47)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Phase Transitions between Accuracy Regimes in L2 regularized Deep Neural Networks
Ersoy, Ibrahim Talha, Wiesner, Karoline
Increasing the L2 regularization of Deep Neural Networks (DNNs) causes a first-order phase transition into the under-parametrized phase -- the so-called onset-of learning. We explain this transition via the scalar (Ricci) curvature of the error landscape. We predict new transition points as the data complexity is increased and, in accordance with the theory of phase transitions, the existence of hysteresis effects. We confirm both predictions numerically. Our results provide a natural explanation of the recently discovered phenomenon of '\emph{grokking}' as DNN models getting stuck in a local minimum of the error surface, corresponding to a lower accuracy phase. Our work paves the way for new probing methods of the intrinsic structure of DNNs in and beyond the L2 context.
- Europe > Germany > Brandenburg > Potsdam (0.05)
- North America > Canada > Alberta > Census Division No. 15 > Improvement District No. 9 > Banff (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
DeepAtlas: a tool for effective manifold learning
Hughes, Serena, Hamilton, Timothy, Kolokotrones, Tom, Deeds, Eric J.
Manifold learning builds on the "manifold hypothesis," which posits that data in high-dimensional datasets are drawn from lower-dimensional manifolds. Current tools generate global embeddings of data, rather than the local maps used to define manifolds mathematically. These tools also cannot assess whether the manifold hypothesis holds true for a dataset. Here, we describe DeepAtlas, an algorithm that generates lower-dimensional representations of the data's local neighborhoods, then trains deep neural networks that map between these local embeddings and the original data. Topological distortion is used to determine whether a dataset is drawn from a manifold and, if so, its dimensionality. Application to test datasets indicates that DeepAtlas can successfully learn manifold structures. Interestingly, many real datasets, including single-cell RNA-sequencing, do not conform to the manifold hypothesis. In cases where data is drawn from a manifold, DeepAtlas builds a model that can be used generatively and promises to allow the application of powerful tools from differential geometry to a variety of datasets.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > New York > New York County > New York City (0.14)
- North America > United States > Texas > Harris County > Houston (0.04)
- (2 more...)
- Education (0.72)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.68)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > Canada (0.04)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.32)
MoCap-Impute: A Comprehensive Benchmark and Comparative Analysis of Imputation Methods for IMU-based Motion Capture Data
Bekhit, Mahmoud, Salah, Ahmad, Alrawahi, Ahmed Salim, Attia, Tarek, Ali, Ahmed, Eldesokey, Esraa, Fathalla, Ahmed
Motion capture (MoCap) data from wearable Inertial Measurement Units (IMUs) is vital for applications in sports science, but its utility is often compromised by missing data. Despite numerous imputation techniques, a systematic performance evaluation for IMU-derived MoCap time-series data is lacking. We address this gap by conducting a comprehensive comparative analysis of statistical, machine learning, and deep learning imputation methods. Our evaluation considers three distinct contexts: univariate time-series, multivariate across subjects, and multivariate across kinematic angles. To facilitate this benchmark, we introduce the first publicly available MoCap dataset designed specifically for imputation, featuring data from 53 karate practitioners. We simulate three controlled missingness mechanisms: missing completely at random (MCAR), block missingness, and a novel value-dependent pattern at signal transition points. Our experiments, conducted on 39 kinematic variables across all subjects, reveal that multivariate imputation frameworks consistently outperform univariate approaches, particularly for complex missingness. For instance, multivariate methods achieve up to a 50% mean absolute error reduction (MAE from 10.8 to 5.8) compared to univariate techniques for transition point missingness. Advanced models like Generative Adversarial Imputation Networks (GAIN) and Iterative Imputers demonstrate the highest accuracy in these challenging scenarios. This work provides a critical baseline for future research and offers practical recommendations for improving the integrity and robustness of Mo-Cap data analysis.
- Oceania > Australia > New South Wales > Sydney (0.14)
- Asia > Middle East > Oman > Ad Dakhiliyah Governorate > Nizwa (0.04)
- Africa > Middle East > Egypt > Sharqia Governorate > Zagazig (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Leisure & Entertainment > Sports (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.67)
When can in-context learning generalize out of task distribution?
Goddard, Chase, Smith, Lindsay M., Ngampruetikorn, Vudtiwat, Schwab, David J.
In-context learning (ICL) is a remarkable capability of pretrained transformers that allows models to generalize to unseen tasks after seeing only a few examples. We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize \emph{out-of-distribution}. Previous work has focused on the number of distinct tasks necessary in the pretraining dataset. Here, we use a different notion of task diversity to study the emergence of ICL in transformers trained on linear functions. We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining task distribution, to a solution which generalizes out of distribution to the entire task space. We also investigate the nature of the solutions learned by the transformer on both sides of the transition, and observe similar transitions in nonlinear regression problems. We construct a phase diagram to characterize how our concept of task diversity interacts with the number of pretraining tasks. In addition, we explore how factors such as the depth of the model and the dimensionality of the regression problem influence the transition.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- North America > Canada (0.04)
NeUQI: Near-Optimal Uniform Quantization Parameter Initialization
Lin, Li, Hu, Xinyu, Wan, Xiaojun
Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumer-grade GPUs or personal devices such as laptops, due to high memory consumption and inference costs. Post-training quantization (PTQ) of LLMs offers a promising solution that reduces their memory footprint and decoding latency. In practice, PTQ with uniform quantization representation is favored for its efficiency and ease of deployment since uniform quantization is widely supported by mainstream hardware and software libraries. Recent studies on $\geq 2$-bit uniform quantization have led to noticeable improvements in post-quantization model performance; however, they primarily focus on quantization methodologies, while the initialization of quantization parameters is underexplored and still relies on the suboptimal Min-Max strategies. In this work, we propose NeUQI, a method devoted to efficiently determining near-optimal initial parameters for uniform quantization. NeUQI is orthogonal to prior quantization methodologies and can seamlessly integrate with them. The experiments with the LLaMA and Qwen families on various tasks demonstrate that our NeUQI consistently outperforms existing methods. Furthermore, when combined with a lightweight distillation strategy, NeUQI can achieve superior performance to PV-tuning, a much more resource-intensive approach.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > Colorado > Denver County > Denver (0.04)
- (2 more...)
Small Models, Smarter Learning: The Power of Joint Task Training
Both, Csaba, Hoover, Benjamin, Strobelt, Hendrik, Krotov, Dmitry, Weidele, Daniel Karl I., Martino, Mauro, Dehmamy, Nima
The ability of a model to learn a task depends strongly on both the task difficulty and the model size. We aim to understand how task difficulty relates to the minimum number of parameters required for learning specific tasks in small transformer models. Our study focuses on the ListOps dataset, which consists of nested mathematical operations. We gradually increase task difficulty by introducing new operations or combinations of operations into the training data. We observe that sum modulo n is the hardest to learn. Curiously, when combined with other operations such as maximum and median, the sum operation becomes easier to learn and requires fewer parameters. We show that joint training not only improves performance but also leads to qualitatively different model behavior. We show evidence that models trained only on SUM might be memorizing and fail to capture the number structure in the embeddings. In contrast, models trained on a mixture of SUM and other operations exhibit number-like representations in the embedding space, and a strong ability to distinguish parity. Furthermore, the SUM-only model relies more heavily on its feedforward layers, while the jointly trained model activates the attention mechanism more. Finally, we show that learning pure SUM can be induced in models below the learning threshold of pure SUM, by pretraining them on MAX+MED. Our findings indicate that emergent abilities in language models depend not only on model size, but also the training curriculum.
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Asia > Middle East > Jordan (0.04)
- Africa > Mali (0.04)
A biconvex method for minimum-time motion planning through sequences of convex sets
Marcucci, Tobia, Halm, Mathew, Yang, Will, Lee, Dongchan, Marchese, Andrew D.
--We consider the problem of designing a smooth trajectory that traverses a sequence of convex sets in minimum time, while satisfying given velocity and acceleration constraints. This problem is naturally formulated as a nonconvex program. T o solve it, we propose a biconvex method that quickly produces an initial trajectory and iteratively refines it by solving two convex subproblems in alternation. This method is guaranteed to converge, returns a feasible trajectory even if stopped early, and does not require the selection of any line-search or trust-region parameter . Exhaustive experiments show that our method finds high-quality trajectories in a fraction of the time of state-of-the-art solvers for nonconvex optimization. In addition, it achieves runtimes comparable to industry-standard waypoint-based motion planners, while consistently designing lower-duration trajectories than existing optimization-based planners. Selecting the most effective motion-planning algorithm for a robotic system often requires balancing three competing objectives: reliability, computational efficiency, and trajectory quality. Consider Sparrow, the robot arm in Figure 1 that sorts individual products into bins before they get packaged in the Amazon warehouses. The algorithms that move Sparrow must be extremely reliable, as these robots handle millions of diverse products every day, and each failure requires expensive interventions. They must be efficient, since every millisecond spent planning is taken away from other crucial computations, and limits the robot reactivity to sensor observations. Finally, they should generate trajectories that push the robot to its physical limits, so that the work-cell throughput is maximized and the hardware is fully utilized. Unfortunately, general-purpose methods for motion planning do not excel in all of these areas at once. Sampling-based methods like PRM [18], RRT [19], and their asymptotically optimal versions [17] can be fast enough for real-time applications. They are also reliable in low-dimensional spaces, where dense sampling is computationally feasible. However, they become significantly less effective as the space dimension grows. Additionally, although their kino-dynamic variants support differential constraints [20, 16, 22], sampling-based methods remain considerably less practical for designing smooth continuous trajectories than producing polygonal paths. Trajectory-optimization methods based on nonconvex programming [1, 33] scale well to high-dimensional spaces and explicitly factor in the robot kinematics and dynamics. Over the years, these techniques have become significantly faster [39, 13] and, with the advent of specialized GPU implementations [35], they are now even viable for real-time motion planning.
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- North America > United States > Massachusetts (0.04)
- North America > United States > Iowa (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)